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ABSTRACT 

We study crowdsourcing quality management, that is, given worker 
responses to a set of tasks, our goal is to jointly estimate the true 
answers for the tasks, as well as the quality of the workers. Prior 
work on this problem relies primarily on applying Expectation- 
Maximization (EM) on the underlying maximum likelihood prob¬ 
lem to estimate true answers as well as worker quality. Unfortu¬ 
nately, EM only provides a locally optimal solution rather than a 
globally optimal one. Other solutions to the problem (that do not 
leverage EM) fail to provide global optimality guarantees as well. 

In this paper, we focus on filtering, where tasks require the eval¬ 
uation of a yes/no predicate, and rating, where tasks elicit integer 
scores from a finite domain. We design algorithms for finding the 
global optimal estimates of correct task answers and worker quality 
for the underlying maximum likelihood problem, and characterize 
the complexity of these algorithms. Our algorithms conceptually 
consider all mappings from tasks to true answers (typically a very 
large number), leveraging two key ideas to reduce, by several or¬ 
ders of magnitude, the number of mappings under consideration, 
while preserving optimality. We also demonstrate that these al¬ 
gorithms often find more accurate estimates than EM-based algo¬ 
rithms. This paper makes an important contribution towards under¬ 
standing the inherent complexity of globally optimal crowdsourc¬ 
ing quality management. 

1. INTRODUCTION 

Crowdsourcing |[8] enables data scientists to collect human-labeled 
data at scale for machine learning algorithms, including those in¬ 
volving image, video, or text analysis. However, human workers 
often make mistakes while answering these tasks. Thus, crowd¬ 
sourcing quality management, i.e., jointly estimating human worker 
quality as well as answer quality—the probability of different an¬ 
swers for the tasks—is essential. While knowing the answer qual¬ 
ity helps us with the set of tasks at hand, knowing the quality of 
workers helps us estimate the true answers for future tasks, and in 
deciding whether to hire or fire specific workers. 

In this paper, we focus on rating tasks, i.e., those where the 
answer is one from a fixed set of ratings G {1, 2,..., R}. This 
includes, as a special cast, filtering tasks, where the ratings are bi¬ 
nary, i.e., {0,1}. Consider the following example: say a data scien¬ 
tist intends to design a sentiment analysis algorithm for tweets. To 
train such an algorithm, she needs a training dataset of tweets, rated 
on sentiment. Each tweet needs to be rated on a scale of {1, 2,3}, 
where 1 is negative, 2 is neutral, and 3 is positive. A natural way to 
do this is to display each tweet, or item, to human workers hired via 
a crowdsourcing marketplace like Amazon’s Mechanical Turk |T|, 
and have workers rate each item on sentiment from 1—3. Since 
workers may answer these rating tasks incorrectly, we may have 


multiple workers rate each item. Our goal is then to jointly esti¬ 
mate sentiment of each tweet and the accuracy of the workers. 

Standard techniques for solving this estimation problem typi¬ 
cally involve the use of the Expectation-Maximization (EM). Ap¬ 
plications of EM, however, provide no theoretical guarantees. Fur¬ 
thermore, as we will show in this paper, EM-based algorithms are 
highly dependent on initialization parameters and can often get 
stuck in undesirable local optima. Other techniques for optimal 
quality assurance, some specific to only filtering (5]|9]|T4j, are not 
provably optimal either, in that they only give bounds on the errors 
of their estimates, and do not provide the globally optimal quality 
estimates. We cover other related work in the next section. 

In this paper, we present a technique for globally optimal quality 
management, that is, finding the maximum likelihood item (tweet) 
ratings, and worker quality estimates. If we have 500 tweets and 
3 possible ratings, the total number of mappings from tweets to 
ratings is 3 500 . A straightforward technique for globally optimal 
quality management is to simply consider all possible mappings, 
and for each mapping, infer the overall likelihood of that mapping. 
(It can be shown that the best worker error rates are easy to deter¬ 
mine once the mapping is fixed.) The mapping with the highest 
likelihood is then the global optimum. 

However, the number of mappings even in this simple example, 
3 500 , is very large, therefore making this approach infeasible. Now, 
for illustration, let us assume that workers are indistinguishable, 
and they all have the same quality (which is unknown). It is well- 
understood that at least on Mechanical Turk, the worker pool is 
constantly in flux, and it is often hard to find workers who have 
attempted enough tasks in order to get robust estimates of worker 
quality. (Our techniques also apply to a generalization of this case.) 

To reduce this exponential complexity, we use two simple, but 
powerful ideas to greatly prune the set of mappings that need to 
be considered, from 3 500 , to a much more manageable number. 
Suppose we have 3 ratings for each tweet—a common strategy in 
crowdsourcing is to get a small, fixed number of answers for each 
question. First, we hash “similar” tweets that receive the same set 
of worker ratings into a common bucket. As shown in Figure [T] 
suppose that 300 items each receive three ratings of 3 (positive), 
100 items each receive one rating of 1, one rating of 2 and one rat¬ 
ing of 3, and 100 items each receive three ratings of 1. That is, 
we have three buckets of items, corresponding to the worker an¬ 
swer sets Bi = {3,3,3}, B 2 = {1,2,3}, and B 3 = {1,1,1}. 
We now exploit the intuition that if two items receive the same set 
of worker responses they should be treated identically. We there¬ 
fore only consider mappings that assign the same rating to all items 
(tweets) within a bucket. Now, since in our example, we only have 
3 buckets each of which can be assigned a rating of 1,2, or 3, we 
are left with just 3 3 = 27 mappings to consider. 
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Next, we impose a partial ordering on the set of buckets based 
on our belief that items in certain buckets should receive a higher 
final rating than items in other buckets. Our intuition is simple: if 
an item received “higher” worker ratings than another item, then 
its final assigned rating should be higher as well. In this example, 
our partial ordering, or dominance ordering on the buckets is B\ > 
B 2 > B 3 , that is, intuitively items which received all three worker 
ratings of 3 should not have a true rating smaller than items in the 
second or third buckets where items receive lower ratings. This 
means that we can further reduce our space of 27 remaining map¬ 
pings by removing all those mappings that do not respect this par¬ 
tial ordering. The number of such remaining mappings is 10, corre¬ 
sponding to when all items in the buckets (L?i, B 2 , B 3 ) are mapped 
respectively to ratings (3,3,3), (3, 3, 2), (3,3,1), (3, 2, 2), (3, 2,1), 
(3,1,1), (2, 2, 2), (2, 2,1), (2,1,1), and (1,1,1). 

In this paper, we formally show that restricting the mappings in 
this way does not take away from the optimality of the solution; 
i.e., there exists a mapping with the highest likelihood that obeys 
the property that all items with same scores are mapped to the same 
rating, and at the same time obeys the dominance ordering relation¬ 
ship as described above. 

Our list of contributions are as follows: 

• We develop an intuitive algorithm based on simple, but key in¬ 
sights that finds a provably optimal maximum likelihood so¬ 
lution to the problem of jointly estimating true item labels and 
worker error behavior for crowdsourced filtering and rating tasks. 
Our approach involves reducing the space of potential ground 
truth mappings while preserving optimality, enabling an ex¬ 
haustive search on an otherwise prohibitive domain. 

• Although we primarily focus on and initially derive our opti¬ 
mality results for the setting where workers independently draw 
their responses from a common error distribution, we also pro¬ 
pose generalizations to harder settings, for instance, when work¬ 
ers are known to come from distinct classes with separate error 
distributions. That said, even the former setting is commonly 
used in practice, and represents a significant first step towards 
understanding the nature and complexity of exact globally op¬ 
timal solutions for this joint estimation problem. 

• We perform experiments on synthetic and real datasets to eval¬ 
uate the performance of our algorithm on a variety of different 
metrics. Though we optimize for likelihood, we also test the 
accuracy of predicted item labels and worker response distribu¬ 
tions. We show that our algorithm also does well on these other 
metrics. We test our algorithm on a real dataset where our as¬ 
sumptions about the worker model do not necessarily hold, and 
show that our algorithm still yields good results. 

1.1 Related Literature 

Crowdsourcing is gaining importance as a platform for a va¬ 
riety of different applications where automated machine learning 


techniques don’t always perform well, e.g ., filtering ED or label¬ 
ing p8][20| [24] of text, images, or video, and entity resolution (2] 
[22] |23[[25| . One crucial problem in crowdsourced applications is 
that of worker quality: since human workers often make mistakes, 
it is important to model and characterize their behavior in order to 
aggregate high-quality answers. 

EM-based joint estimation techniques. We study the particular 
problem of jointly estimating hidden values (item ratings) and a 
related latent set of parameters (worker error rates) given a set of 
observed data (worker responses). A standard machine learning 
technique for estimating parameters with unobserved latent vari¬ 
ables is Expectation Maximization (TO). There has been significant 
work in using EM-based techniques to estimate true item values 
and worker error rates, such as (7[ |2T[|26l , and subsequent modifi¬ 
cations using Bayesian techniques |3[|16|. In |[19), the authors use 
a supervised learning approach to learn a classifier and the ground 
truth labels simultaneously. In general, these machine learning 
based techniques only provide probabilistic guarantees and cannot 
ensure optimality of the estimates. We solve the problem of find¬ 
ing a global, provably maximum likelihood solution for both the 
item values and the worker error rates. That said, our worker error 
model is simpler than the models considered in these papers—in 
particular, we do not consider worker identities or difficulties of 
individual items. While we do provide generalizations to our ap¬ 
proach that relax some of these assumptions, they can be ineffi¬ 
cient in practice. However, we study this simpler model in more 
depth, providing optimality guarantees. Since our work represents 
the first providing optimality guarantees (even for a restricted set¬ 
ting), it represents an important step forward in our understanding 
of crowdsourcing quality management. Furthermore, anecdotally, 
even the simpler model is commonly used for platforms like Me¬ 
chanical Turk, where the workers are fleeting. 

Other techniques with no guarantees. There has been some work 
that adapts techniques different from EM to solve the problem of 
worker quality estimation. For instance, Chen et al. |4| adopts 
approximate Markov Decision Processes to perform simultaneous 
worker quality estimation and budget allocation. Liu et al. (Hi 
uses variational inference for worker quality management on filter¬ 
ing tasks (in our case, our techniques apply to both filtering and 
rating). Like EM-based techniques, these papers do not provide 
any theoretical guarantees. 

Weaker guarantees. There has been a lot of recent work on pro¬ 
viding partial probabilistic guarantees or asymptotic guarantees on 
accuracies of answers or worker estimates, for various problem set¬ 
tings and assumptions, and using various techniques. We first de¬ 
scribe the problem settings adopted by these papers, then their so¬ 
lution techniques, and then describe their partial guarantees. 

The most general problem setting adopted by these papers is 
identical to us (i.e., rating tasks with arbitrary bipartite graphs con¬ 
necting workers and tasks) (27); most papers focus only on filter¬ 
ing EM’ or operate only when the graph is assumed to be 
randomly generated fl3] p~4) . Furthermore, most of these papers 
assume that the false positive and false negative rates are the same. 

The papers draw from various techniques, including just spectral 
methods |5]|9), just message passing (Ht a combination of spectral 
methods and message passing ED or a combination of spectral 
methods and EM (27). 

In terms of guarantees, most of the papers provide probabilis¬ 
tic bounds (5] [13] |14] [27j, while some only provide asymptotic 
bounds |9|. For example, Dalvi et al. (5), which represents the 
state of the art over multiple papers (9] |13] [T4j, show that under 
certain assumptions about the graph structure (depending on the 
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eigenvalues) the error in their estimates of worker quality is lower 
than some quantity with probability greater than 1 — S. 

Thus, overall, all the work discussed so far provides probabilistic 
guarantees on their item value predictions, and error bound guaran¬ 
tees on their estimated worker qualities. In contrast, we consider 
the problem of finding a global maximum likelihood estimate for 
the correct answers to tasks and the worker error rates. 

Other related papers. Joglekar et al. 0 consider the problem 
of finding confidence bounds on worker error rates. Our paper is 
complementary to theirs in that, while they solve the problem of 
obtaining confidence bounds on the worker error rates, we consider 
the problem of finding the maximum likelihood estimates to the 
item ground truth and worker error rates. 

Zhou et al. (28|[29) use minimax entropy to perform worker qual¬ 
ity estimation as well as inherent item difficulty estimation; here the 
inherent item difficulty is represented as a vector. Their technique 
only applies when the number of workers attempting each task is 
very large; here, overfitting (given the large number of hidden pa¬ 
rameters) is no longer a concern. For cases where the number of 
workers attempting each task is in the order of 50 or 100 (highly 
unrealistic in practical applications), the authors demonstrate that 
the scheme outperforms vanilla EM. 

Summary. In summary, at the highest level, our work differs from 
all previous work in its focus on finding a globally optimal solu¬ 
tion to the maximum likelihood problem. We focus on a simpler 
setting, but do so in more depth, representing significant progress 
in our understanding of global optimality. Our globally optimal so¬ 
lution uses simple and intuitive insights to reduce the search space 
of possible ground truths, enabling exhaustive evaluation. Our gen¬ 
eral framework leaves room for further study and has the potential 
for more sophisticated algorithms that build on our reduced space. 

2. PRELIMINARIES 

We start by introducing some notation, and then describe the 
general problem that we study in this paper; specific variants will 
be considered in subsequent sections. 

Items and Rating Questions. We let I be a set of |I| — n items. 
Items could be, for example, images, videos, or pieces of text. 

Given an item / E I, we can ask a worker w to answer a rating 
question on that item. That is, we ask a worker: What is your rating 
r for the item /?. We allow workers to rate the item with any value 
G{1,2 ,...,/?}. 

Example 2.1. Recall our example application from Section [7] 
where we have R — 3 and workers can rate tweets as being nega¬ 
tive (r — 1), neutral (r = 2), or positive (r = 3). Suppose we have 
two items, I — {/i, R} where I± is positive, or has a true rating of 

3, and I 2 is neutral, or has a true rating of 2. 

Response Set. We assume that each item I is shown to m arbitrary 
workers, and therefore receives m ratings E [1 ,/?]. We denote 
the set of ratings given by workers for item I as M(/) and write 
M(/) = (vrj vr-i, ..., vi) if item /El receives Vi responses 

R 

of rating “z” across workers, 1 < i < R. Thus, = m. We 

i =1 

call M(I) the response set of I, and M the worker response set in 
general. 

Continuing with Example |2.1[ suppose we have m = 2 work¬ 
ers rating each item on the scale of {1, 2,3}. Let h receive one 
worker response of 2 and one worker response of 3. Then, we 
write M(/i) = (1,1,0). Similarly, if I 2 receives one response of 
3 and one response of 1, then we have M{If) — (1, 0,1). 


Modeling Worker Errors. We assume that every item /El has 
a true rating in [1, R] that is not known to us in advance. What 
we can do is estimate the true rating using the worker response 
set. To estimate the true rating, we need to be able to estimate the 
probabilities of worker errors. 

We assume every worker draws their responses from a common 
(discrete) response probability matrix , p , of size R x R. Thus, 
p(i, j) is the probability that a worker rates an item with true value 
j as having rating i. Consider the following response probability 
matrix of the workers described in our example (R = 3): 



0.7 

0.1 

0.2 

0.2 

0.8 

0.2 

0.1 

0.1 

0.6 


H ere, the j th column represents the different probabilities of worker 
responses when an item’s true rating is j. Correspondingly, the i th 
row represents the probabilities that a worker will rate an item as 
i. Wehavep(l,l) = 0.7,p(2,1) — 0.2,p(3,l) = 0.1 meaning 
that given an item whose true rating is 1, workers will rate the item 
correctly with probability 0.7, give it a rating of 2 with probability 
0.2, and give it a rating of 3 with probability 0.1. The matrix p is 
in general not known to us. We aim to estimate both p and the true 
ratings of items in I as part of our computation. 

Note that we assume that every response to a rating question 
returned by every worker is independently and identically drawn 
from this matrix: thus, each worker responds to each rating ques¬ 
tion independently of other questions they may have answered, and 
other ratings for the same question given by other workers; and 
furthermore, all the workers have the same response matrix. In our 
example, we assume that all four responses (2 responses to each 
of / 1 , 12) are drawn from this distribution. We recognize that as¬ 
suming the same response matrix is somewhat stringent—we will 
consider generalizations to the case where we can categorize work¬ 
ers into classes (each with the same response matrix) in Section [5] 
That said, while our techniques can still indeed be applied when 
there are a large number of workers or worker classes with distinct 
response matrices, it may be impractical. Since our focus is on un¬ 
derstanding the theoretical limits of global optimality for a simple 
case, we defer to future work fully generalizing our techniques to 
apply to workers with distinct response matrices, or when worker 
answers are not independent of each other. 

Mapping and Likelihood We call a function /:I— >>{1,2,...,/?} 
that assigns ratings to items a mapping. The set of actual ratings of 
items is also a ma pping. We call that the ground truth mapping , T. 
For Example |2.1[ T(/i) = 3, T(/ 2 ) = 2. 

Our goal is to find the most likely mapping and worker response 
matrix given the response set M. We let the probability of a spe¬ 
cific mapping, /, being the ground truth mapping, given the worker 
response set M and response probability matrix p be denoted by 
Pr(/|M,p). Using Bayes rule, we have Pr(/|M,p) = &Pr(M|/,p), 
where Pr(M|/,p) is the probability of seeing worker response set 
M given that / is the ground truth mapping and p is the true worker 
error matrix. Here, k is the constant given by k = p r r ^ , where 
Pr(/) is the (constant) apriori probability of being the ground truth 
mapping and Pr(M) is the (constant) apriori probability of seeing 
worker response set M. Thus, Pr(M|/,p) is the probability of 
workers providing the responses in M, had / been the ground truth 
mapping and p been the true worker error matrix. We call this value 
the likelihood of the mapping-matrix pair, /, p. 

We illustrate this concept on our example. We have M(/i) = 

(1,1,0) and M(/ 2 ) = (1,0,1). Let us compute the likelihood 
of the pair f : p when /(/ 1 ) = 2, f(h) = 2 and p is the matrix 
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Symbol 

Explanation 

I 

Set of items 

M 

Items-workers response set 

f 

Items-values mapping 

P 

Worker response probability matrix 

Pr(M|/,p) 

Likelihood of (/, p) 

m 

Number of worker responses per item 

T 

Ground truth mapping 


Table 1: Notation Table 


displayed above. We have 

Pr(M|/,p> = Pr(M(7 1 )|/,p)Pr(M(7 2 )|/,p) 

assuming that rating questions on items are answered independently. 
The quantity Pr(M(/i) |/, p) is the probability that workers draw¬ 
ing their responses from p respond with M(/i) to an item with 
true rating /(/i). Again, assuming independence of worker re¬ 
sponses, this quantity can be written as the product of the prob¬ 
ability of seeing each of the responses that I\ receives. If / is 
given as the ground truth mapping, we know that the probability 
of receiving a response of i is p{i , f(h)) — p(i , 2). Therefore, 
the probability of seeing M(l, 1,0), that is one response of 3 and 
one response of 2, is p( 3, 2)p(2, 2) = 0.1 x 0.8 = 0.08. Similarly, 
Pr(M(/ 2 )|/,p) = p(3,2)xp(l,2) = O.lxO.l = 0.01. Combin¬ 
ing all of these expressions, we have Pr(M|/,p) = 0.01 x 0.08 = 
8 x 10 -4 . Thus, our goal can be restated as: 

Problem 2.1 (Maximum Likelihood Problem). Given 
M, I, find 

argmax Pr (M | /, p) 
f,p 

A naive solution would be to look at every possible mapping /', 
compute p — argmax Pr (M | /', p) and choose the f maximizing 

p 

the likelihood value Pr(M|/ / ,p / ). The number of such mappings, 
B) 1 ^ is however exponentially large. 

We list our notation in Table [I] for ready reference. 

3. FILTERING PROBLEM 

Filtering can be regarded as a special case of rating where R = 2. 
We discuss it separately, first, because its analysis is significantly 
simpler, and at the same time provides useful insights that we then 
build upon for the generalization to rating, that is, to the case where 
R > 2. For example, consider the filtering task of finding all im¬ 
ages of Barack Obama from a given set of images. For each image, 
we ask workers the question “is this a picture of Barack Obama”. 
Images correspond to items and the question “is this a picture of 
Barack Obama” corresponds to the filtering taskon each item. We 
can represent an answer of “no” to the question above by a score 0 , 
and an answer of “yes” by a score 1. Each item /El now has an 
inherent true value in { 0 , 1 } where a true value of 1 means that the 
item is one that satisfies the filter, in this case, the image is one of 
Barack Obama. Mappings here are functions / : I —>► {0,1}. 

Next, we formalize the filtering problem in Section [TT] describe 
our algorithm in Section [33] prove a maximum likelihood result in 
Section [33] and evaluate our algorithm in Section [33] 

3.1 Formalization 

Given the response set M, we wish to find the maximum like¬ 
lihood mapping / : I -A {0,1} and 2x2 response probability 
matrix, p. For the filtering problem, each item has an inherent true 
value of either 0 or 1 , and sees m responses of 0 or 1 from dif¬ 
ferent workers. If item I receives m — j responses of 1 and j 


responses of 0 , we can represent its response set with the tuple or 
pair M(I) = (ra — j,j). 

Consider a worker response probability matrix of p — ^ n © * 

U.o U.o 

The first column represents the probabilities of worker responses 
when an item’s true rating is 0 and the second column represents 
probabilities when an item’s true rating is 1. Given that all workers 
have the same response probabilities, we can characterize their re¬ 
sponse matrix by just the corresponding worker false positive (FP) 
and false negative (FN) rates, eo and e\. That is, eo = p(l, 0) is 
the probability that a worker responds 1 to an item whose true value 
is 0 , and e± = p( 0 , 1 ) is the probability that a worker responds 0 
to an item whose true value is 1. We have p( 1,1) = 1 — ei and 
p(0, 0) = 1 — eo. Here, we can describe the entire matrix p with 
just the two values, eo = 0.3 and e\ — 0.2. 

Filtering Estimation Problem. Let M be the observed response 
set on item-set I. Our goal is to find 

f*,eo,et = argmax Pr(M |/, e 0 ,ei) 

/,e 0 ,ei 

Here, Pr(M|/, eo, ei) is the probability of getting the response set 
M, given that / is the ground truth mapping and the true worker 
response matrix is defined by eo, ei. 

Dependance of Response Probability Matrices on Mappings. 

Due to the probabilistic nature of our workers, for a fixed ground 
truth mapping T, different worker error rates, eo and e± can pro¬ 
duce the same response set M. These different worker error rates, 
however have varying likelihoods of occurrence. This leads us to 
observe that worker error rates (eo, ei) and mapping functions (/) 
are not independent and are related through any given M. In fact, 
we show that for the maximum likelihood estimation problem, fix¬ 
ing a mapping / enforces a maximum likelihood choice of eo, ei. 
We leverage this fact to simplify our problem from searching for 
the maximum likelihood tuple /, eo,ei to just searching for the 
maximum likelihood mapping, /. Given a response set M and a 
mapping, /, we call this maximum likelihood choice of eo, e± as 
the parameter set of /, M, and represent it as Params(/, M). The 
choice of Params(/, M) is very intuitive and simple. We show that 
we just can estimate eo as the fraction of times a worker disagreed 
with / on an item I in M when /(/) = 0, and correspondingly, ei 
as the fraction of times a worker responded 0 to an item /, when 
/(/) = 1. Under this constraint, we can prove that our original 
estimation problem, 

argmaxPr(M|/, eo, ei) 

/,e o,e± 

simplifies to that of finding 

argmax Pr(M |/, eS,eI) 
f 

where e$, e* are the constants given by Params(/, M). 

Example 3.1. Suppose we are given 4 items I = {1 1 , h, h, h} 
with ground truth mapping T — (T(/i), T(/2), T(If)) — 

( 1 , 0 , 1 , 1 ). 

Suppose we ask m — 3 workers to evaluate each item and re¬ 
ceive the following number of( “1 ”, “ 0 ”) responses for each respec¬ 
tive item: M(h) = (3,0 ),M(/ 2 ) = ( 1,2 ),M(/ 3 ) = ( 2,1 ),M(/ 4 ) 
(2,1). Then, we can evaluate our worker false positive and false 
negative rates as described above: eo = 3 ^ 3 ^ = | (from items 
I 1 J 2 , and I 4 ) and e\ — | (from I 2 ). 

We shall henceforth refer to Pr(M|/) = Pr(M|/, Params(/, M)) 
as the likelihood of a mapping /. For now, we focus on the problem 
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of finding the maximum likelihood mapping with the understand¬ 
ing that finding the error rates is straightforward given the map¬ 
ping is fixed. In Section |34| we formally show that the problem of 
jointly finding the maximum likelihood response matrix and map¬ 
ping can be solved by just finding the most likely mapping /*. The 
most likely triple /, eo, e\ is then given by /*, Params(/*, M). 

It is easy to calculate the likelihood of a given mapping. We have 
Pr(M|/) = n Pr(M(/)|/,e 0 ,ei), where e 0 ,ei = Params(/, M) 
iei 

and Pr(M(/) |/, eo, ei) is the probability of seeing the response set 
M{I) on an item / El. Say M(J) = (m — j,j). Then, we have 


Pr(M(/)|/) = 


1(1 

\e™-i(l-e 0 y 


for/(/) = l 
for /(/) = 0 


This can be evaluated in 0(m) for each / E I by doing one pass 
over M(I). Thus, Pr(M|/) = n JG i Pr(M(/)|/) can be evalu¬ 
ated in 0{m |I|). We use this as a building block in our algorithm 
below. 


3.2 Globally Optimal Algorithm 

In this section, we describe our algorithm for finding the max¬ 
imum likelihood mapping, given a response set M on an item set 
I. A naive algorithm could be to scan all possible mappings, /, 
calculating for each, e5,ej = Params(/, M) and the likelihood 
Pr(M|/, ej, ej). The number of all possible mappings is, how¬ 
ever, exponential in the number of items. Given n = |I| items, we 
can assign a value of either 0 or 1 to any of them, giving rise to 
a total of 2 n different mappings. This makes the naive algorithm 
prohibitively expensive. 

Our algorithm is essentially a pruning based method that uses 
two simple insights (described below) to narrow the search for the 
maximum likelihood mapping. Starting with the entire set of 2 n 
possible mappings, we eliminate all those that do not satisfy one 
of our two requirements, and reduce the space of mappings to be 
considered to 0 ( 777 ), where m is the number of worker responses 
per item. We then show that just an exhaustive evaluation on this 
small set of remaining mappings is still sufficient to find a global 
maximum likelihood mapping. 

We illustrate our ideas on the example from Section [3TT| repre¬ 
sented graphically in Figure [2] We will explain this figure below. 

Bucketizing. Since we assume (for now) that all workers draw 
their responses from the same probability matrix p (i.e., have the 
same eo, e± values), we observe that items with the exact same set 
of worker responses can be treated identically. This allows us to 
bucket items based on their observed response sets. Given that there 
are m worker responses for each item, we have m + 1 buckets, 
starting from m “1” and zero “0” responses, down to zero “1” and 


m “0” responses. We represent these buckets in Figure [2] The x- 
axis represents the number of 1 responses an item receives and the 
y-axis represents the number of 0 responses an item receives. Since 
every item receives exactly m responses, all possible response sets 
lie along the line x + y = mn. We hash items into the buckets 
corresponding to their observed response sets. Intuitively, since all 
items within a bucket receive the same set of responses and are for 
all purposes identical, two items within a bucket should receive the 
same value. It is more reasonable to give both items a value of 1 or 
0 than to give one of them a value of 1 and the other 0. 

In our example (Figure [2]), the set of possible responses to any 
item is {(3,0), (2,1), (1, 2), (0,3)}, where (3 — j,j) represents 
seeing 3 — j responses of “1” and j responses of “0”. We have 
Ii in the bucket (3,0), I3 , h in the bucket (2,1), I2 in the bucket 
(1,2), and an empty bucket (0,3). We only consider mappings, 
/, where items in the same bucket are assigned the same value, 
that is, f(h) = /(/ 4 ). This leaves 2 4 mappings corresponding to 
assigning a value of 0/1 to each bucket. In general, given m worker 
responses per item, we have m + 1 buckets and 2 m+1 mappings 
that satisfy our bucketizing condition. Although for this example 
m + l = 77, typically we have m <C n. 

Dominance Ordering. Second, we observe that buckets have an 
inherent ordering. If workers are better than random, that is, if their 
false positive and false negative error rates are less than 0 . 5 , we 
intuitively expect items with more “1” responses to be more likely 
to have true value 1 than items with fewer “1” responses. Ordering 
buckets by the number of “1” responses, we have (777, 0 ) (m — 
1 , 1 ) —> ... —> (l,m — l) —y (0,777), where bucket (777 — j, j) 
contains all items that received m — j “1” responses and j “0” 
responses. We eliminate all mappings that give a value of 0 to a 
bucket with a larger number of “1” responses while assigning a 
value of 1 to a bucket with fewer “1” responses. We formalize this 
intuition as a dominance relation , or ordering on buckets, (777, 0 ) > 
(777 — 1 , 1 ) > ... > ( 1,777 — 1 ) > (0,777), and only consider 
mappings where dominating buckets receive a value not lower than 
any of their dominated buckets. 

Let us impose this dominance ordering on our example. For in¬ 
stance, h (three workers respond “1”) is more likely to have ground 
truth value “1”, or dominates, I3 , h, (two workers respond “1”), 
which in turn dominate I2. So, we do not consider mappings that 
assign a value of “0” to a h and “1” to either of h,h- Figure [2] 
shows the dominance relation in the form of directed edges, with 
the source node being the dominating bucket and the target node be¬ 
ing the dominated one. Combining this with our bucketizing idea, 
we discard all mappings which assign a value of “0” to a dominat¬ 
ing bucket (say response set (3,0)) while assigning a value of “1” 
to one of its dominated buckets (say response set (2,1)). 
Dominance-Consistent Mappings. We consider the space of map¬ 
pings satisfying our above bucketizing and dominance constraints, 
and call them dominance-consistent mappings. We can prove that 
the maximum likelihood mapping from this small set of mappings 
is in fact a global maximum likelihood mapping across the space 
of all possible “reasonable” mappings: mappings corresponding to 
better than random worker behavior. 

To construct a mapping satisfying our above two constraints, we 
choose a cut-point to cut the ordered set of response sets into two 
partitions. The corresponding dominance-consistent mapping then 
assigns value “1” to all (items in) buckets in the first (better) half, 
and value “0” to the rest. For instance, choosing the cut-point be¬ 
tween response sets (2,1) and (1,2) in Figure [5]results in the cor¬ 
responding dominance-consistent mapping, where {h , I3, h}, are 
mapped to “1”, while {I2}, is mapped to “0”. We have 5 differ¬ 
ent cut-points, 0,1, 2, 3,4, each corresponding to one dominance- 
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consistent mapping. Cut-point 0 corresponds to the mapping where 
all items are assigned a value of 0 and cut-point 4 corresponds to 
the mapping where all items are assigned a value of 1. In particular, 
the figure shows the dominance-consistent mapping corresponding 
to the cut-point c = 2. In general, if we have m responses to each 
item, we obtain m + 2 dominance-consistent mappings. 


Definition 3.1 (Dominance-consistent mapping f c ). 
For any cut-point c £ 0,1,..., ra+1, we define the corresponding 
dominance-point mapping f c as 


nr) 


1 ifM(I) G {(ra, 0),..., (m — c + 1, c — 1)} 
0 if M(I) £ {(m - c, c),..., (0, m)} 


Our algorithm enumerates all dominance-consistent mappings, 
computes their likelihoods, and returns the most likely one among 
them. As there are (m + 2) mappings, e ach o f whose likelihoods 
can be evaluated in 0(m\I\), (See Section[3j]) the running time of 
our algorithm is 0 (m 2 |I|). 


Algorithm 1 Cut-point Algorithm 

1: I := Input Item-set 
2: M Input Response Set 

3: / := {} {Different dominance-consistent (mappings)} 

4: eo := {} {eo rates corresponding to cut-functions} 

5: ei := {} {ei rates corresponding to cut-functions} 

6: Likelihood := {} {Likelihoods corresponding to cut- 

functions} 

{Enumerating across cut-points:} 

7: for c in {0,1,..., m + 1} do 
8 : eo[c], e\ [c] := Params(/[c], M) 

9: Likelihood[c ] := Pr(M|/[c], eo[c], e\ [c]) 

10: end for 

11: c* := a,Tgm&xLikelihood[c] 

12: RETURN(/[c*], e 0 [c*], ei [c*]) 


In the next section we prove that in spite of only searching the 
much smaller space of dominance-consistent mappings, our algo¬ 
rithm finds a global maximum likelihood solution. 

3.3 Proof of Correctness 

Reasonable Mappings. A reasonable mapping is one which corre¬ 
sponds to a better than random worker behavior. Consider a map¬ 
ping / corresponding to a false positive rate of eo > 0.5 (as given 
by Params (/, M)). This mapping is unreasonable because workers 
perform worse than random for items with value “0”. Given I, M, 
let / : I -A {0,1}, with eo,ei = Params(/, M ) be a mapping 
such that eo < 0.5 and e± < 0.5. Then / is a reasonable map¬ 
ping. It is easy to show that all dominance-consistent mappings are 
reasonable mappings. 

Now, we present our main result on the optimality of our algo¬ 
rithm. We show that in spite of only considering the local space of 
dominance-consistent mappings, we are able to find a global max¬ 
imum likelihood mapping. 

Theorem 3.1 (Maximum Likelihood). We let M be the 
given response set on the input item-set I. Let F be the set of all rea¬ 
sonable mappings and F dom be the set of all dominance-consistent 
mappings. Then, 

max Pr(M|/*) = maxPr(M|/) 

/*£ F d °m /E F 


PROOF 3.1. We divide our proof into steps. The first step de¬ 
scribes the overall flow of the proof and provides the high level 
structure for the remaining steps. Step 1: Suppose f is not a 
dominance-consistent mapping. Then, either it does not satisfy 
the bucketizing constraint, or it does not satisfy the dominance- 
constraint. We claim that if f is reasonable, we can always con¬ 
struct a dominance-consistent mapping, f* such that Pr(M|/*) > 
Pr(M|/). Then, it trivially follows that max^* GF dom Pr(M|/*) = 
max/ G F Pr(M|/). We show the construction of such an f* for ev¬ 
ery reasonable mapping f in the following steps. 

Step 2: (Dominance Inconsistency). Suppose f does not satisfy 
the dominance constraint. Then, there exists at least one pair of 
items Ih I 2 such that M(Ii) > and f(Ii) = 0, /(/ 2 ) = 1. 

Define mapping f as follows: 

|7(/ 2 ) for I — I\ 

f'(I)={f(h ) M I = h 

[/CO VJ € I\{h,h} 

Mapping f' is identical to f everywhere except for at 1 1 and I 2 , 
where it swaps their respective values. We show that Pr(M|/') > 
Pr(M|/). Let M(/i) = (m — i,i) and M(/ 2 ) = (m — j,j) 
where i < j and /(/ 1 ) — 0,/(/ 2 ) — 1. Let riki (respectively 
riko) denote the number of items, excluding ii,/ 2 , with response 
set (m — k,k) in M such that /(/) = 1 (respectively 0). We 
abuse notation slightly to use tiki, riko to also denote the sets of 
items, excluding ii, / 2 , with response set (m — /c, k) and a value 
of 1,0 respectively under f, wherever the meaning is clear from 
the context. Given /, M, we can calculate the response probability 
matrix as shown in Section \3.4\ Let poo = 1 — eo be the proba¬ 
bility that workers respond 0 to an item with mapping 0 under f. 
Similarly, pn = 1 — e± be the probability that workers respond 
1 to an item with mapping 1 under f. Given /, eo,ei, we have 

Pr(M|/) as Yl Pr(M(/)|/, eo, ef). We can write 
1 

(m - j) + XX m “ fyrik 1 

k 

P 11 = -- 

m + 2 ^mn k 1 

k 


and 

i + XX m “ 

Poo = -^7^- 

m + 2_^mrik 0 
k 

We split the likelihood of f into two independent parts. Let the 
probability contributed by items in n k 1 be Pn (M\f), and that con¬ 
tributed by riko bePio(M\f), such that Pr(M|/) — Pn(M|/) Pro(M|/). 
We claim that for reasonable mappings, Pri(M|/') > Pri(M|/)A 
Pr 0 (M|/') > Pr 0 (M|/). Then, we have Pr(M|/') > Pr(M|/). 

We prove below that Pri(M|/ / ) > Pn(M|/) The proof for Pro 
can be derived in a similar fashion. We have 


( m-j)+J2(rn-k)n kl 


Pr(M|/)=p u 


(1 -P 11 ) 


j+J 2 kn k 1 
k 


(m —z)+ 52 (m —A:)n fcl 

We can similarly calculate Pri (M\f), where p'n = - m+ |- mrifcl -• 

k 

Now, let a — (m—i)-\-^2(rn—k)n k i, b — (m—j)-\-^2(m—k)rik 1 , 

k k 


and c — m T ^mriki. We then have pX/ffi// bfr(e-b) c - b '‘ 

k 

Note that since i < j, we have a > b. It can then be shown 
that for a + b > c, Furthermore, for reasonable 
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mappings, we have pn > \ => a, b > | => a + b > c. There- 
fore, pV/ffi/') > 1 => Pri(M|/') > Pri(M|/). Similarly, we 
can show Pro(M|/ / ) > Pro(M|/), and therefore, Pr(M|/') > 
Pr(M|/). 

Step 3: (Bucketizing Inconsistency). Suppose f does not satisfy 
the bucketizing constraint. Then, we have at least one pair of items 
I \, I 2 such that M(/i) — M(1 2 ) and /(/ 1 ) ^ /(/ 2 ). Consider 
the two mappings fi and / 2 defined as follows: 


Mi) 


f(h) for I = I\ 

/(/) v/ei\{/i} 


hit) 


f(h) for I = I 2 

/(/) V/€l\{/ 2 } 


The mappings fi and / 2 are identical to f everywhere except for 
at h and I 2 , where fi(Ii) = fi(h) = f(1 2 ) and / 2 (/i) = 
/ 2 (/ 2 ) — /(/ 1 ). We can show (using a similar calculation as in 
Step 2) that max(Pr(M|/i), Pr(M|/ 2 )) > Pr(M|/). Let f — 
argmax /lj/2 (Pr(M|/i),Pr(M|/ 2 )). 

Step 4: (Reducing Inconsistencies). Suppose f is not a dominance- 
consistent mapping. We have shown that by reducing either a buck¬ 
etizing inconsistency (Step 2), or a dominance inconsistency (Step 
3), we can construct a new mapping, f with likelihood greater 
than or equal to that of f. Now, if f' is a dominance-consistent 
mapping, set f* — f and we are done. If not, look at an incon¬ 
sistency in f' and apply steps 2 or 3 to it. With each iteration, we 
are reducing at least one inconsistency while increasing likelihood. 
We repeat this process iteratively, and since there are only a finite 
number of inconsistencies in f to begin with, we are guaranteed to 
end up with a desired dominance-consistent mapping f* satisfying 
Pr(M|/*) > Pr(M|/). This completes our proof. □ 


3.4 Calculating error rates from mappings 

In this section, we formalize the correspondence between map¬ 
pings and worker error rates that we introduced in Section |3.1| 
Given a response set M and a mapping /, say we calculate the cor¬ 
responding worker error rates eo (/, M) , e± (/, M) = Params(/, M) 
as follows: 


PROOF 3.2. Let M, / be given. By Bayes theorem, 

Pr(eo,ei|/, M) — fcPr(M|eo,ei, /) for some constant k. There¬ 
fore, argmax Pr(eo, ei |/, M) = argmax Pr(M|eo, ei, /). Now, 

e 0 ) e l e 0 ; e l 

Pr(Af|e 0 ,ei,/) = n Pr(M(/)|/, e 0 , ei). 
ie 1 

Let lyo C I 3 /(/) = 0, M(J) — (m — j, j)V I G Ij and 
Ij,i C I 3 /(/) = 1, M(I) = (m - j,j)VI e Ij. We have 


Pr(M(7)|/, eo , ei ) 


(1 - ei) m - J e J 1 V/€ Ij,i 
e tT _J (l - eoy'VI € Ij.o 


Therefore, Pr(M\f,e 0 ,ei) = Yl[(l - ei) m ■’e^] 11 ^ 11 [e™ J (1 - 

3 

m m 

eo) -7 ]^’ 0 . For ease of notation, let a\ — ^2j\Ij,i\, 61 = ^2 (m— 

3 =0 j =0 

m m 

j)\Ij,i\, clo — Di|P/\o|, and bo = 22 ~ j)\Ij,o\- Then, we 

j =0 j =0 

have Pr(M|/,e 0 ,ei) = (1 - e^ef 1 ^ - e 0 ) a °eQ°. 

To maximize Pr(M|eo, ei, /) g/vew M, /, we compute its par¬ 
tial derivates with respect to eo and e± and set them to 0. We 
have, dPr ( M ^ 0 ’ e id) _ q ^ ai (i — ei ) — b iei — o (simpli¬ 
fying common terms). Therefore, e\ — a 2+b 1 ~ ^ ^ L • I ‘ ^ ^ 

easy to verify that the second derivative d f s ne g a . 

five for this value of e 1. We recall that this is the value of ei under 
Params(f , M). Similarly, we can also show that dPr ( M ^’ ei ’f) _ 

0 forces eo — ^22^2 which is the value given by Params(f , M). 

Therefore, Params(f , M) m argmax Pr(eo, e\ |/, M). □ 

e 0 J e l 


Next, we show that instead of simultaneously trying to find the most 
likely mapping and false positive and false negative error rates, it is 
sufficient to just find the most likely mapping while assuming that 
the error rates corresponding to any chosen mapping, /, are always 


given by Params(/, M). We formalize this intuition in Lemma 3.2 
below. 


Lemma 3.2 (Likelihood of a Mapping). Let f e F be 
any mapping and M be the given response set on I. We have, 


1. Let Ij C I 3 /(/) = 0, M(I) = (m - j,j)VI G Ij. Then, 
e°- m£ m o |,.| 


2 . 


Let Ij C I 9 /(/) = 1, M(I) = (m - j,j)VI € Ij. Then, 


ei 


E,r =0 i 1 li 

m ^r = 0 1 li 


Intuitively, eo (/, M) (respectively e\ (/, M)) is just the fraction 
of times a worker responds with a value of 1 (respectively 0) for 
an item whose true value is 0 (respectively 1), under response set 
M assuming that / is the ground truth mapping. We show that for 
each mapping /, this intuitive set of false positive and false negative 
error rates, (eo, ef) maximizes Pr(M|/, eo, ef). We express this 
idea formally below. 


Lemma 3.1 (Params (/, M)). Given response set M. Let 
Pr(eo, ei | /, M) be the probability that the underlying worker false 
positive and negative rates are eo and e\ respectively, conditioned 
on mapping f being the true mapping. Then, 

V/, Params(f : M) — argmax Pr(eo, ei|/, M) 

eo> e l 


max Pr(M|/, eo, ei) — maxPr(M|/, Params(f, M)) 

/,eo,ei f 

PRO OF 3 .3. The proof for this statement follows easily from 
Lemma \3.I\ Let f*,eo,e* = max Pr(M|/, eo, ei). Now, let 

/,e 0 ,e1 

e'o , ei — Params(f*,M). From Lemma [XT] we have e 0l e[ — 
argmax Pr(M| /*, eo, ei). So we have, Pr (M\f*,Params(f*, M)) > 

e 0 ; e l 

Pr(M|/*, e5, e*). Additionally, maxPr(M|/, Params (/, M)) > 
Pr(M|/*, Params(f*, M)). Therefore, 

maxPr(M|/, Params(f, M)) > max Pr(M|/, eo, ei) 

f f,e o,ei 

max Pr(M|/, eo, ei) > maxPr(M|/, Params(f, M)) by 

f,e 0>ei f 

definition. 

Therefore, combining the above inequalities, we have 
max Pr(M|/, eo, ei) = maxPr(M|/, Params(f, M)) □ 

f i e Oi e l f 

3.5 Experiments 

The goal of our experiments is two-fold. First, we wish to ver¬ 
ify that our algorithm does indeed find higher likelihood mappings. 
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Second, we wish to compare our algorithm against standard base¬ 
lines for such problems, like the EM algorithm, for different metrics 
of interest. While our algorithm optimizes for likelihood of map¬ 
pings, we are also interested in other metrics that measure the qual¬ 
ity of our predicted item assignments and worker response proba¬ 
bility matrix. For instance, we test what fraction of item values are 
predicted correctly by different algorithms to measure the quality 
of item value prediction. We also compare the similarity of pre¬ 
dicted worker response probability matrices with the actual under¬ 
lying matrices using distance measure like Earth-Movers Distance 
and Jensen-Shannon Distance. We run experiments on both simu¬ 
lated as well as real data and discuss our findings below. 

3.5.1 Simulated Data 

Dataset generation. For our synthetic experiments, we assign 
ground truth 0-1 values to n items randomly based on a fixed selec¬ 
tivity. Here, a selectivity of s means that each item has a probability 
of s of being assigned true value 1 and 1 — s of being assigned true 
value 0. This represents our set of items I and their ground truth 
mapping, T. We generate a random “true” or underlying worker re¬ 
sponse probability matrix, with the only constraint being that work¬ 
ers are better than random (false positive and false negative rates 
< 0.5). We simulate the process of workers responding to items by 
drawing their response from their true response probability matrix, 
Ptrue- This generates one instance of the response set, M. Differ¬ 
ent algorithms being compared now take I, M as input and return 
a mapping / : I —{0,1 }, and a worker response matrix p. 

Parameters varied. We experiment with different choices over 
both input parameters and comparison metrics over the output. For 
input parameters, we vary the number of items n, the selectivity 
(which controls the ground truth) s, and the number of worker re¬ 
sponses per item, m. While we try out several different combina¬ 
tions, we only show a small representative set of results below. In 
particular, we observe that changing the value of n does not signif¬ 
icantly affect our results. We also note that the selectivity can be 
broadly classified into two categories: evenly distributed (s = 0.5), 
and skewed ( s > 0.5 or s < 0.5). In each of the following plots 
we use a set of n — 1000 items and show results for either s — 0.5 
or s = 0.7. We show only one plot if the result is similar to and 
representative of other input parameters. More experimental results 
can be found in the appendix, Section [ATT] 

Metrics. We test the output of different algorithms on a few differ¬ 
ent metrics: we compare the likelihoods of their output mappings, 
we compare the fraction of items whose values are predicted in¬ 
correctly, and we compare the quality of predicted worker response 
probability matrix. For this last metric, we use different distance 
functions to measure how close the predicted worker matrix is to 
the underlying one used to generate the data. In this paper we re¬ 
port our distance measures using an Earth-Movers Distance (EMD) 
based score GD For a full description of our EMD based score and 
other distance metrics used, we refer to the appendix, SectionjAT] 

Algorithms. We compare our algorithm, denoted OPT, against 
the standard Expectation-Maximization (EM) algorithm that is also 
solving the same underlying maximum likelihood problem. The 
EM algorithm starts with an arbitrary initial guess for the worker re¬ 
sponse matrix, pi and computes the most likely mapping fi corre¬ 
sponding to it. The algorithm then in turn computes the most likely 
mapping p 2 corresponding to fi (which is not necessarily pi) and 
repeats this process iteratively until convergence. We experiment 
with different initializations for the EM algorithm, represented by 
EM(1),EM(2),EM(3). EM( 1) represents the starting point 
with false positive and negative rates eo, e\ = 0.25 (workers are 


better than random), EM{ 2) represents the starting point of eo , e\ = 
0.5 (workers are random), and EM( 3) represents the starting point 
of eo,ei = 0.75 (workers are worse than random). EM(*) is 
the consolidated algorithm which runs each of the three EM in¬ 
stances and picks the maximum likelihood solution across them for 
the given I, M. 

Setup. We vary the number of worker responses per item along the 
x-axis and plot different objective metrics (likelihood, fraction of 
incorrect item value predictions, accuracy of predicted worker re¬ 
sponse matrix) along the y-axis. Each data point represents the 
value of the objective metric averaged across 1000 random tri¬ 
als. That is, for each fixed value of m, we generate 1000 differ¬ 
ent worker response matrices, and correspondingly 1000 different 
response sets M. We run each of the algorithms over all these 
datasets, measure the value of their objective function and average 
across all problem instances to generate one point on the plot. 

Likelihood. Figure |3(a)| shows the likelihoods of mappings re¬ 
turned by our algorithm OPT and the different instances of the 
EM algorithm. In this experiment, we use s — 0.5, that is items’ 
true values are roughly evenly distributed over {0,1}. Note that the 
y-axis plots the likelihood on a log scale, and that a higher value is 
more desirable. We observe that our algorithm does indeed return 
higher likelihood mappings with the marginal improvement going 
down as m increases. However, in practice, it is unlikely that we 
will ever use m greater than 5 (5 answers per item). While our 
gains for the simple filtering setting are small, as we will see in 
Section |43| the gains are significantly higher for the case of rating, 
where multiple error rate parameters are being simultaneously es¬ 
timated. (For the rating case, only false positive and false negative 
error rates are being estimated.) 

Fraction incorrect. In Figure [3(b)] (s = 0.7), we plot the fraction 
of item values each of the algorithms predicts incorrectly and av¬ 
erage this measure over the 1000 random instances. A lower score 
means a more accurate prediction. We observe that our algorithm 
estimates the true values of items with a higher accuracy than the 
EM instances. 

EMD score. To compare the qualities of our predicted worker false 
positive and false negative error rates, we compute and plot EMD- 
based scores in Figure [3(c )] (s = 0.5) and Figure [4(a )] (s = 0.7). 
Note that since EMD is a distance metric, a lower score means that 
the predicted worker response matrices are closer to the actual ones; 
so, algorithms that are lower on this plot do better. We observe that 
the worker response probability matrix predicted by our algorithm 
is closer to the actual probability matrix used to generate the data 
than all the EM instances. While EM( 1) in particular does well 
for this experiment, we observe that EM( 2) and EM( 3) get stuck 
in bad local maxima making EM(*) prone to the initialization. 

Although averaged across a large number of instances, EM( 1) 
and EM (*) do perform well, our experiments show that optimiz¬ 
ing for likelihood does not adversely affect other potential param¬ 
eters of interest. For all metrics considered, OPT performs better, 
in addition to giving us a global maximum likelihood guarantee. 
(As we will see in the rating section, our results are even better 
there since multiple parameters are being estimated.) We experi¬ 
ment with a number of different parameter settings and comparison 
metrics and present more extensive results in the appendix. 

3.5.2 Real Data 

Dataset. In this experiment, we use an image comparison dataset |6| 
where 19 workers are asked to each evaluate 48 tasks. Each task 
consists of displaying a pair of sporting images to a worker and 
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Figure 3: Synthetic Data Experiments: (a)Likelihood, s = 0.5 (b) Fraction Incorrect, s = 0.7 (c) EMD Score, s = 0.5 


asking them to evaluate if both images show the same sportsper- 
son. We have the ground truth yes/no answers for each pair, but do 
not know the worker error rates. Note that for this real dataset, our 
assumptions that all workers have the same error rates and answer 
questions independently may not necessarily hold true. We show 
that in spite of the assumptions made by our algorithm, they esti¬ 
mate the values of items with a high degree of accuracy even on 
this real dataset. 

To evaluate the performance of our algorithm and the EM-based 
baseline, we compare the the estimates for the item values against 
the given ground truth. Note that since we do not have a ground 
truth for worker error rates under this setting, we cannot evaluate 
the algorithms for that aspect—we do however study likelihood of 
the final solutions from different algorithms. 


Setup. We vary the number of workers used from 1 to 19 and plot 
the performance of algorithms OPT , EM( 1), EM( 2), EM( 3), 
EM (*) similar to Section 3.5.1 We plot the number of worker 
responses used along the x-axis. For instance, a value of m = 4 
indicates that for each item, four random worker responses are cho¬ 
sen. The four workers answering one item may be different from 
those answering another item. This random sample of the response 
set is given as input to the different algorithms. Similar to our sim¬ 
ulations, we average our results across 100 different trials for each 
data point in our subsequent plots. For each fixed value of mn, one 
trial corresponds to choosing a set of m worker responses to each 
item randomly. We run 100 trials for each m, and correspondingly 
generate 100 different response sets M. We run our OPT and 
EM algorithms over all these datasets, measure the value of differ¬ 
ent objectives function and average across all problem instances to 
generate one point on a plot. 


Likelihood. Figure [4(b)] plots the likelihoods of the final solution 
for different algorithms. We observe that except for EM{ 2), all 
algorithms have a high likelihood. This can be explained as fol¬ 
lows: EM( 2) which starts with an initialization of eo and e± rates 
around 0.5 and converges to a final response probability matrix in 
that neighborhood. Final error rates of around 0.5 (random) will 
have naturally low likelihood when there is a high amount of agree¬ 
ment between workers. EM(1) and EM( 3) on the other hand 
start with, and converge to near opposite extremes with EM{ 1) 
predicting eo/ei rates « 0 and EM{ 3) predicting error rates « 1. 
Both of these, however, result in a high likelihood of observing the 
given response, with EM( 1) predicting that the worker is always 
correct, and EM( 3) predicting that the worker is always incorrect, 
i.e., adversarial. Even though EM( 1) and EM{ 3) often converge 
to completely opposite predictions of item-values because of their 
initializations, their solutions still have similar likelihoods corre¬ 
sponding to the intuitive extremes of perfect and adversarial worker 
behavior. This behavior thus demonstrates the strong dependence 
of EM- based approaches on the initialization parameters. 


Fraction Incorrect. Figure |4(c)| plots the fraction of items pre¬ 
dicted incorrectly along the y-axis for OPT and the EM algo¬ 
rithms. Correspondingly, their predictions for item values are op¬ 
posite, as can be seen in Figure [4(c)] 

We observe that both EM( 1) and our algorithm OPT do fairly 
well on this dataset even when a very few number of worker re¬ 
sponses are used. However, EM(*), which one may expect would 
typically do better than the individual EM initializations, some¬ 
times does poorly compared to OPT by picking solutions of high 
likelihood that are nevertheless not very good. Note that here we 
assume that worker identities are unknown and arbitrary workers 
could be answering different tasks — our goal is to characterize the 
behavior of the worker population as a whole. For larger datasets, 
we expect the effects of population smoothing to be greater and 
our assumptions on worker homogeneity to be closer to the truth. 
So, even though our algorithm provides theoretical global guaran¬ 
tees under somewhat strong assumptions, it also performs well for 
settings where our assumptions may not necessarily be true. 


4. RATING PROBLEM 

In this section, we extend our techniques from filtering to the 
problem of rating items. Even though the main change resides in 
the possible values of items ({0,1} for filtering and {1,..., R} for 
rating), this small change adds significant complexity to our dom¬ 
inance idea. We show how our notions of bucketizing and domi¬ 
nance generalize from the filtering case. 

4.1 Formalization 

Recall from Section [2] that, for the rating problem, workers are 
shown an item and asked to provide a score from 1 to R , with R be¬ 
ing the best (highest) and 1 being the worst (lowest) score possible. 
Each item from the set I receives m worker responses and all the re¬ 
sponses are recorded in M. We write M(I) — (vr, vr- i ,..., vi) 
if item / G I receives Vi responses of “z”, 1 < i < R. Recall that 

R 

^2 Vi — m. Mappings are functions / : I —)> {1, 2,..., R} and 

i =1 

workers are described by the response probability matrix p, where 
p(z, j) (z, j G {1,2,..., R}) denotes the probability that a worker 
will give an item with true value j a score of z. Our problem is 
defined as that of finding /* ,p* = argmaxPr(M|/,p) given M. 

f,p 

As in the case of filtering, we use the relation between p and 
/ through M to define the likelihood of a mapping. We observe 
that for maximum likelihood solutions given M, fixing a mapping 
/ automatically fixes an optimal p = Params(/, M). Thus, as 
before, we focus our attention on the mappings, implicitly finding 
the maimum likelihood p as well. The following lemma and its 
proof sketch capture this idea. 
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Figure 4: Synthetic Data Experiments: (a) EMD Score, s = 0.7 Real Data Experiments: (b) Likelihood (c) Fraction Incorrect 


Lemma 4.1 (Likelihood of a mapping). We have 

maxPr(M|/,p) — maxPr(M|/, Params(f : M)) 
f,p f 

where 3Params(f, M) — argmax Pr(p|/, M) 

p 


PROOF 4.1. Given mapping f and evidence M, we can calcu¬ 
late the worker response probability matrix p — Params{f , M ) as 
follows. Let the i th dimension of the response set of any item I by 
That is, ifM(I) = (vr,...,v i), then Mi (I) = v\. Let 


h ci 3 /(/) = m,I € Ii. Then, p(i,j) = ^^ Vt, j. 
Intuitively, Params(f , M ) (i, j) is just the fraction of times a worker 
responde d i to an item that is mapped by f to a value of j. Similar 
to Lemma . 


3.1 


we can show that Params(f , M ) — argmax 


Pr(p|/, M). Consequently, it follows that 


maxPr(M|/,p) = maxPr(M|/, Params(f, M)) □ 

/,p f 


Denoting the likelihood of a mapping, Pr(M|/, Params(/, M)), 
as Pr (M | /), our maximum likelihood rating problem is now equiv¬ 
alent to that of finding the most likely mapping. Thus, we wish to 
solve for argmax Pr(M|/). 
f 

4.2 Algorithm 

Now, we generalize our idea of bucketized, dominance-consistent 
mappings from Section |T2| to find a maximum likelihood solution 
for the rating problem. Although we primarily present the intu¬ 
ition below, we formalize our dominance relation and consistent- 
mappings in Section |4~4| and further prove some interesting proper¬ 
ties. 

Bucketizing. For every item, we are given m worker responses, 
each in 1,2,..., R. It can be shown that there are dif¬ 

ferent possible worker response sets, or buckets. The bucketizing 
idea is the same as before: items with the same response sets can 
be treated identically and should be mapped to the same values. So 
we only consider mappings that give the same rating score to all 
items in a common response set bucket. 

Dominance Ordering. Next we generalize our dominance con¬ 
straint. Recall that for filtering with m responses per item, we had 
a total ordering on the dominance relation over response set buck¬ 
ets, (ra, 0) > (m —1,1) > ... > (1, m — 1) > (0, m) where no 
dominated bucket could have a higher score (“1”) than a dominat¬ 
ing bucket (“0”). Let us consider the simple example where R — 3 
and we have m — 3 worker responses per item. Let (i,j, k) denote 
the response set where i workers give a score of “3", j workers 
give a score of “2" and k workers give a score of “1". Since we 
have 3 responses per item, i + j + k = 3. Intuitively, the response 
set (3,0,0) dominates the response set (2,1,0) because in the first, 



Figure 5: Dominance-DAG for 3 workers and scores in { 1,2,3} 

three workers gave items a score of “3”, while in the second, only 
two workers give a score of “3” while one gives a score of “2”. 
Assuming a “reasonable” worker behavior, we would expect the 
value assigned to the dominating bucket to be at least as high as the 
value assigned to the dominated bucket. Now consider the buckets 
(2,0,1) and (1, 2,0). For items in the first bucket, two workers 
have given a score of “3”, while one worker has given a score of 
“1”. For items in the second bucket, one worker has given a score 
of “3”, while two workers have given a score of “2”. Based solely 
on these scores, we cannot claim that either of these buckets dom¬ 
inates the other. So, for the rating problem we only have a partial 
dominance ordering, which we can represent as a DAG. We show 
the dominance-DAG for the R — 3, m — 3 case in Figure [5] For 
arbitrary m,R, we can define the following dominance relation. 

Definition 4.1 (Rating Dominance). Bucket B\ with re¬ 
sponse set (vr, ..., v\) dominates bucket B 2 with response 

set (y 2 R , ..., vi) if and only if 31 < r' < R 3 (yl — 

VrVr 0 {r f , r — 1}) and (u*/ = v^, + 1) A = v^_ 1 — 1). 

Intuitively, a bucket B\ dominates B 2 if increasing the score given 
by a single worker to B 2 by 1 makes its response set equal to that of 
items in B\. Note that a bucket can dominate multiple buckets, and 
a dominated bucket can have multiple dominating buckets, depend¬ 
ing on which worker’s response is increased by 1. For instance, in 
Figure [5] bucket (2,1,0) dominates both (2,0,1) (increase one 
score from “1” to “2”) and (1, 2,0) (increase one score from “2” to 
“3”), both of which dominate (1,1,1). 

Dominance-Consistent Mappings. As with filtering, we consider 
the set of mappings satisfying both the bucketizing and dominance 
constraints, and call them dominance-consistent mappings. 

Dominance-consistent mappings can be represented using cuts 
in the dominance-DAG. To construct a dominance-consistent map¬ 
ping, we split the DAG into at most R partitions such that no parent 
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node belongs to an intuitively “lower” partition than its children. 
Then we assign ratings to items in a top-down fashion such that all 
nodes within a partition get a common rating value lower than the 
value assigned to the partition just above it. Figure [5] shows one 
such dominance-consistent mapping corresponding to a set of cuts. 
A cut with label c — i essentially partitions the DAG into two sets: 
the set of nodes above all receive ratings > i while all nodes be¬ 
low receive ratings < i. To find the most likely mapping, we sort 
the items into buckets and look for mappings over buckets that are 
consistent with the dominance-DAG. We use an iterative top-down 
approach to enumerate all consistent mappings. First, we label our 
nodes in the DAG from 1... according to their topolog¬ 

ical ordering, with the root node starting at 1. In the i th iteration, 
we assume we have the set of all possible consistent mappings as¬ 
signing values to nodes 1... i — 1 and extend them to all consistent 
mappings over nodes 1... i. When the last n °de has 

been added, we are left with the complete set of all dominance- 
consistent mappings. 


Algorithm 2 Dominance-Consistent Mappings 

1: I Input Item- set 

2: M Input Evidence Matrix 

3: F := {} {Different dominance-consistent (mappings)} 

4: p := {} {Worker matrices corresponding to mappings} 

5: Likelihood := {} {Likelihoods corresponding to mappings} 
6: Construct V, E = Dominance-DAG 
{Enumerating consistent mappings} 

7: for v in BFS(V) do 

8: (expand dominance-DAG by BFS) 

9: for / in F do 

10: (expand old mappings to include v) 

11 : lower := mm v ’ efX ents( v ) f{v') 

12 : upper := max/ epMts( „) f{v') 

13: for i in lower to upper do 

14: fnew[i\ := f U {v = i} 

15: F.add(fnew [i]){add new mappings corresponding to 

the dominance-consistent possible values for v } 

16: end for 

17: Delete /{delete old mappings that only mapped nodes 

1 , 2 ,— 1 } 

18: end for 

19: end for 

20: for / in F do 

21: p[f] := Params(/, M) 

22: Likelihood[f} := Pr(M|/,p[/]) 

23: end for 

24: /* := axgm&KLikelihood[f] 

25: RETURN//*,_p[/*]]) 


As with the filtering problem, we can show that an exhaustive 
search of the dominance-consistent mappings under this dominance 
DAG constraint gives us a global maximum likelihood mapping 
across a much larger space of reasonable mappings. Suppose we 
have n items, R rating values, and m worker responses per item. 
The number of buckets of possible worker response sets (nodes in 
the DAG) is Then, the number of unconstrained map¬ 

pings is R n and number of mappings with just the bucketizing 
condition, that is where items with the same response sets get as- 

signed the same value, is iC R ~ 1 K We enumerate a sample set 
of values in Table [2] for n — 100 items. We see that the num¬ 
ber of dominance-consistent mappings is significantly smaller than 
the number of unconstrained mappings. The fact that this greatly 


R 

m 

Unconstrained 

Bucketized 

Dom-Consistent 

3 

3 

lO 37 

6 x 10 4 

126 

3 

4 

10 37 

o 

1—1 

462 

3 

5 

10 37 

To 1 * 

1716 

4 

3 

IqBD 

W 2 

2.8 x 10 4 

4 

4 

lO 50 

M 

o 

1—1 

2.7 x 10 b 

5 

2 

IqOT 

To 1 * 

2.8 x 10 4 

5 

3 

£> 
o 
1—1 

N 

o 

1—1 

1.1 x 10* 


Table 2: Number of Mappings for n = 100 items 


reduced set of intuitive mappings contains a global maximum like¬ 
lihood solution displays the power of our approach. Furthermore, 
the number of items may be much larger, which would make the 
number of unconstrained mappings exponentially larger. 


4.3 Experiments 

We perform experiments using simulated workers and synthetic 
data for the rating problem using a setup similar to that described in 
Section lT5.ll Since our results and conclusions are similar to those 
in the filtering section, we show the results from one representative 
experiment and refer interested readers to the appendix, Section 
IB.ll for further results. 


Setup. We use 1000 items equally distributed across true ratings of 
{1, 2,3} (R = 3). We randomly generate worker response proba¬ 
bility matrices and simulate worker responses for each item to gen¬ 
erate one response set M. We plot and compare various quality 
metrics of interest, for instance, the likelihood of mappings and 
quality of predicted item ratings along the y-axis, and vary the num¬ 
ber of worker responses per item, m, along the x-axis. Each data 
point in our plots corresponds to the outputs of corresponding al¬ 
gorithms averaged across 100 randomly generated response sets, 
that is, 100 different Ms. The initializations of the EM algorithms 
correspond to the worker response probability matrices EM(1) = 


0.6 0.33 

0.33 0.34 
0.07 0.33 


0.07 

0.33 

0.6 


,EM( 2) = 


0.34 

0.33 

0.33 


0.33 

0.34 

0.33 


0.33 

0.33 

0.34 


, and EM( 3) 


0.07 

0.33 

0.6 

0.33 

0.34 

0.33 

0.6 

0.34 

0.07 


Intuitively, EM(1) starts by assuming that 


workers have low error rates, EM{ 2) assumes that workers answer 
questions uniformly randomly, and EM( 3) assumes that workers 
have high (adversarial) error rates. As in Section 3.5.1 EM (*) 


picks the most likely from the three different EM instances for each 
response set M. 


Likelihood. Figure [6] plots the likelihoods (on a natural log scale) 
of the mappings output by different algorithms along the y-axis. 
We observe that the likelihoods of the mappings returned by our 
algorithm, OPT, are significantly higher than those of any of the 
EM algorithms. For example, consider m — 5: we observe that our 
algorithm finds mappings that are on average 9 orders of magnitude 
more likely than those returned by EM (*) (in this case the best 
EM instance). As with filtering, the gap between the performance 
of our algorithm and the EM instances decreases as the number of 
workers increases. 


Quality of item rating predictions. In Figure [7] we compare the 
predicted ratings of items against the true ratings used to generate 
each data point. We measure a weighted score based on how far the 
predicted value is from the true value; a correct prediction incurs no 
penalty, a predicted rating that is =L1 of the true rating of the item 
incurs a penalty of 1 and a predicted rating that is ±2 of the true 
rating of the item incurs a penalty of 2. We normalize the final score 
by the number of items in the dataset. For our example, each item 
can result in a maximum penalty of 2, therefore, the computed score 
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Figure 7: Item prediction (distance weighted), R = 3 

is in [0, 2], with a lower score implying more accurate predictions. 
Again, we observe that in spite of optimizing for, and providing a 
global maximum likelihood guarantee, our algorithm predicts item 
ratings with a high degree of accuracy. 

Comparing these results to those in Section |T 5. 1[ our gains for 
rating are significantly higher because the number of parameters 
being estimated is much higher, and the EM algorithm has more 
“ways” it can go wrong if the parameters are initialized incorrectly. 
That is, with a higher dimensionality we expect that EM converges 
more often to non-optimal local maxima. 

We observe that in spite of being optimized for likelihood, our 
algorithm performs well, often beating EM, for different metrics 
of comparison on the predicted item ratings and worker response 
probability matrices. 


4.4 Formalizing dominance 

In this section, we formalize our dominance relation and prove 
that it is in fact a partial order, and more specifically, a lattice. Let 
V b e the set of all possible item response sets. Recall from Sec¬ 


tion 


4.2 


that |V| = ( R ^A 1 x ). Definition 


4.1 


defines the notion 


of one response set just dominating, or covering another response 


set. If response set Vl covers response set V 2 under Definition [4T] 
we write Vl y V2. We extend that definition to include transitive 
dominance below. 


Definition 4.2 (Transitive dominance). Let h, / 2 , h 
he any three items with response sets M(I\) = Vi,M(l 2 ) — 
V2, M(Iz) — V3. We define the transitive dominance relation (Vt) 
on sets I and V as follows: 

1. I At /V/ E I and V At VWV E V 

2. If 1 1 y I 2 , then h At h- Similarly, Vl y V 2 => Vi At V 2 

3. If h At h A I 2 At h, then h At h- Similarly, Vl At 
V 2 A V 2 At V 3 => Vi At V 3 

If I\ At I 2 A I\ f I 2 , we write Ii Vt I 2 - Intuitively, the tran¬ 
sitive dominance relation constitutes the transitive closure of the 
dominance relation. 


Definition 4.3 (Partial Ordering). Let > be a binary 
relation on set S. We say that > defines a partial order on S if the 
following are satisfied for all x,y, z E S: 

1. Reflexivity: x > x. 

2. Antisymmetry: If x > y and y > x, then x — y. 

3. Transitivity: If x > y and y > z, then x > z. 

We show below that our dominance relation imposes a partial 
ordering on the set of possible item response sets. To do so, we 
first introduce the idea of a cumulative distribution , and use it to 
characterize our transitive dominance relation. 

Lemma 4.2 (Cumulative Distribution). Let A = 
(a R ,a R -i,...,ai),B = (b R , b R -\,..., bi) E V be any two re¬ 
alizations such that A At B. Let Cum(A) = ( A R , A R - 1 ,..., Af) 
and Cum(B) = ( B R , B R - 1 ,..., Bf) be their cumulative distri¬ 
bution functions, where Aj = ca and Bj = bi- Then, 

Ai > BA/i El ton. 


Proof. Let A = X 1 >- X 2 >- - - - V Xk = B be a sequence of 
realizations just dominating (or covering) the next. Intuitively, to 
move from Xj = (xj f i,Xj f 2 , • • •, Xj,n) to Xj +1 = 

(xj+ 1 , 1 , Xj+ i s 2 , • • •, ^j+i,n), we need to shift one vote from some 
bucket k to k+V That is, Xj^ Xj^~ 1 anda^fc+i -A Xj : jt+i + 1 
(follows from Definition |4.1| ). The path from A = X\ to B = Xk 
can be represented by a sequence of such unit vote moves towards 
higher ratings. Let the total number of votes shifted from bucket i 
to bucket i + 1 in the entire path < X ±, Xk > be Si. Then, B = 
(fiR : b R - 1 ,..., bi) = (clr — 5r+5 R -i, ..., 03 — £ 3 +^ 2 , a 2 — + 
Si , a\ — <5i). Note that we are constrained by Sr = 0 (cannot move 
any further than highest bucket, R) and 0 < Si < ai-\-Si-iii < R. 
Now, it is easy to verify that Cum(B) = ( Br , Br- 1 ,..., Bf) = 
(Ar - Sr,Ar- 1 - Sr- 1 , ...,A 1 - Si). Since S { > 0 Vi, we have 
Ai = Bi + Si => Ai > Bi\/i. □ 

Lemma |4~2| gives us a way to represent descendants in our dom¬ 
inance ordering using the cumulative distribution function. We use 
this idea to prove that our dominance relation is a partial order on 
the set of realizations, and more specifically, a lattice. 

Lemma 4.3 (Partial Order). The relation on the set 
of items I or the set of response sets, V defines a partial ordering 
on the respective domains. 

Proof. We show that (V, y t ) is a partial order. From Definition 
|4.2| we have V V VWV E V. So, our dominance relation y is 
reflexive. 

Let AV t B and B A for some A m (clr, clr- 1 ,..., ai), B =* 
(b R , b R - 1 ,..., 61 ) E V. Consider the cumulative distribution 
function, Cum(A) = ( Ar , Ar- 1 ,..., Af) and Cum(B) = 

(■ Br , B R - 1 ,..., Bi), where Aj YH=i a i and B j = ^ =1 bi. 
From Lemma [4~2] we have A y t B Ai > BX/i. Similarly, 
B y t A => Bi > Aiii. Combining, we have Ai = BAH aim 
bA/i. Therefore, A — B and is antisymmetric. 

From Definition |4.2| we have Vi V t V 2 A V 2 At V 3 =3* Vi At 
V 3 VV 1 , V 2 , V 3 E V. So, At is transitive. 

Therefore, the relation At is a partial order. □ 

We further show that the partial order imposed by our dominance 
relation, At, is in fact a lattice. A lattice can be defined as follows. 

Definition 4.4 (Lattice). A partially ordered set (V, A) 
is a lattice if it satisfies the following properties: 
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1. V is finite. 

2. There exists a maximum element V* G V such that V* y 
VWV G V. 

3. Every pair of elements has a greatest lower hound (meet), 

that is, Wi, v 2 e V3V' e v 3 (Vi, v 2 y v') a (fiv e 
v) 3v u v 2 yvy v'. 

We now show that our partial ordering y t on the set of realiza¬ 
tions V is a lattice. 


Theorem 4.1 (Lattice Proof). The dominance partial or¬ 
dering (V,y t ) is a lattice. 

Proof. It is easy to see that our set of realizations is finite (| V| = 

( R • Next, consider the element V* = (M, 0,..., 0). We 
have, V* ht VWV G V \ V*. Therefore, all that remains to be 
shown is that every pair of realizations has a unique greatest lower 
bound. 

Let A = (a R ,a R -i,...,ai),B = (b R , &h-i,..., &i) G V be 
any two realizations with cumulative distribution functions Cum(A) - 
(A R: A R - 1,..., Ai) and Cum(B) = ( Br, Br- i, ..., Bf), Aj — 
J2i =i a i and B o = Z)i =i hi. Let D = (c£r, dn-i,..., di) be any 
common descendant (or lower bound) of A, B, that is A,By t D. 
Let Cum(D) = (Dr, D R - i,..., Df). From Lemma [4~2| it fol¬ 
lows that we can find 5 — (Sr, ... ,Si) and S' = (£#,.. -T^i) such 
that Cum(D) = (A R - S R ,..., A\ - Si) = (B R -S' R ,..., B\ - 
££) where 0 < Si < Ai, 0 < S[ < Bi Wi < R, and S R = S' R = 0. 

Choose S* = max(0, Ai — Bi) and S'* = max(0 , Bi — Ai). 
That is, if Ai < Bi, we have S* = 0, S'* — Bi — Ai and if 
Bi < Ai, S* — Ai — Bi,S'* — 0. Let D* be the lower bound to 
A, B constructed from S*,S'* such that Cum(D *) = Cum(A) — 

S* = Cum(B) - S'* and 0 < S* < Ai, 0 < S'* < Bi Wi < R, 
and Sr = S' R = 0. So, D* is a common lower bound of A, B. 
We claim that D* is in fact the greatest lower bound of A, B. We 
prove our claim in two steps. 

First, let D be any strict upper bound or ancestor of D*. We 
show that D cannot be a common lower bound to A, B. From 
we have Cum(D) — Cum(D* 


Lemma 
(Ah, .. 


4.2 


+ A where A — 

, Ai), such that A i > 0 Wi and A& > 0 for some k. 
Now Dk = D% + Afc. From our construction of D*, we have 
Dl = Ak — SI = Bk — S'jf where one of { S %, S'jf } is 0. Without 
loss of generality, suppose Sk = 0. Then, Dk = Ak + Afc. Since 
Dk > Ak, by Lemma |4~2| A D. 

Second, let D ^ D* be any lower bound to A, B. We show that 
3D' y t D such that D' is also a lower bound to A, B. Let D be 
the lower bound constructed from S , S', that is, Di — Ai — Si — 
Bi — S'fWi • Now, 3k 3 Sk,S' k 0 (otherwise D — D*). Construct 
D' such that D[ — DiWi / k and D' k = Dk + min (Sk,S' k ). It is 
easy to verify that D' D and D' is a lower bound of A, B. 

Combining the facts that (a) D* is lower bound of A, B with 
no ancestor that is also a lower bound of A, B, and (b) Any other 
lower bound of A, B can be shown to have an ancestor that is also a 
lower bound of A, B, we have D* is greatest lower bound of A, B. 
This completes our proof. □ 


Next, we formally define dominance-consistent mappings that 
are consistent with the above intuition. 

Definition 4.5 (Dominance-Consistent Mapping). We 
call a function f 6 G F : I —>• [1, R\ a Dominance-Consistent map¬ 
ping if it satisfies the following properties: 

1. Let Ii,h G I be any two items. If M (If) = M(I 2 ), then 

f s (h) = f(i 2 ). 


2. Let V U V 2 e V 3 M(h) = Vi y t v 2 = M(/ 2 ). Then, 

f{h) < f{h). 

We denote the set of all dominance consistent mappings by F'* C 

F. 

Intuitively, the first property ensures that a consistent mapping as¬ 
signs the same bucket to items with the same observed response 
sets. The second property states that the mapping is consistent 
with the transitive dominance relation, that is, an item with a bet¬ 
ter response set is mapped to at least as high a bucket as an item 
with a worse response set. Note that it is crucial to use the transi¬ 
tive dominance relation, and not just the dominance relation when 
defining consistent mappings to preserve our intuition. Otherwise, 
consider an example where there exist two items I\ : I 2 G I such 
thatM(ii) M(I 2 ) and yet, jBI G I such that M(If) y M(I). 
It would then be possible to construct a consistent mapping, /, with 
f(Ii) > f(I 2 ), which would violate the intuition behind consistent 
mappings. 

5. EXTENSIONS 

In this section we discuss the generalization of our bucketizing 
and dominance-based approach to some extensions of the filtering 
and rating problems. Recall our two major assumptions: (1) every 
item receives the same number (m) of responses, and (2) all work¬ 
ers are randomly assigned and their responses are drawn from a 
common distribution, p(i, j). We now relax each of these require¬ 
ments and describe how our framework can be applied. 

5.1 Variable number of responses 

Suppose different items may receive different numbers of worker 
responses, e.g. because items are randomly chosen, or workers 
choose some questions preferentially over others. Note in this sec¬ 
tion we are still assuming that all workers have the same response 
probability matrix p. 

For this discussion we restrict ourselves to the filtering problem; 
a similar analysis can be applied to rating. Suppose each item can 
receive a maximum of m worker responses, with different items re¬ 
ceiving different numbers of responses. Again, we bucketize items 
by their response sets and try to impose a dominance-ordering on 
the buckets. Now, instead of only considering response sets of the 
form ( m — j , j), we consider arbitrary (i, j). Recall that a response 
set (i,j) denotes that an item received i “1” responses and j “0” re¬ 
sponses. We show the imposed dominance ordering in Figure [8] 

We expect an item that receives i “1” responses and j “0” re¬ 
sponses to be more likely to have true value “1” than an item with 
i — 1 “1” responses and j “0” responses, or an item with i “1” 
responses and j + 1 “0” responses. So, we have the dominance 
relations (i,j) > (i — 1, j) where i > 1, j > 0, i + j < m, and 
(i,j) > (i,j + 1) with i,j > 0,i + j + 1 < m. Note that the 
dominance ordering imposed in Section[3] (m, 0) > (m — 1,1) > 
... > (0, m), is implied transitively here. For instance, (m, 0) > 
(m — 1, 0) A (m — 1, 0) > (m — 1,1) => (m, 0) > (m — 1,1). 
Also note that this is a partial ordering as certain pairs of buckets, 
(0, 0) and (1,1) for example, cannot intuitively be compared. 

Again, we can reduce our search for the maximum likelihood 
mapping to the space of all bucketized mappings consistent with 
this dominance (partial) ordering. That is, given item set I and 
response set M, we consider mappings / : I —>> {0,1}, where 
M(h) - M(I 2 ) =* f(h) = f (I 2 ) and M(If) > M(I 2 ) =* 
f(If) > f (I 2 ) • We show two such dominance consistent map¬ 
pings, fi and fj in Figure [8] Mapping fi assigns all items with at 
least i “1” worker responses to a value of 1 and the rest to a value 
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of 0. Similarly, mapping fj assigns all items with at most j “0” 
responses a value of 1 and the rest a value of 0. We can construct 
a third dominance-consistent mapping fij from a conjunction of 
these two: fij (I) = 1 if and only if fi — 1 A fj = 1, that is, 
fij assigns only gives those items that have at least i “1” worker 
responses and at most j “0” responses, a value of 1. We can now 
describe fi and fj as special instances of the dominance-consistent 
mapping fij when j = m and i — 0 respectively. 

We claim that all dominance-consistent mappings for this setting 
can be described as the union of different f % j s for a set of 0 < i,j < 
m , for a total of 0(2 m ) dominance-consistent mappings. Note that 
although this expression is exponential in the maximum number of 
worker responses per item, m, for most practical applications this 
is a very small constant. We discuss this statement and describe our 
proof for it in the appendix, Section [Cl] 

5.2 Worker classes 

So far we have assumed that all workers are identical, in that they 
draw their answers from the same response probability matrix, a 
strong assumption that does not hold in general. Although we could 
argue that different worker matrices could be aggregated into one 
average probability matrix that our previous approach discovers, if 
we have fine-grained knowledge about workers, we would like to 
exploit it. In this section we consider the setting where there are two 
of classes of workers, expert and regular workers to evaluate the 
same set of items. We discuss the generalization to larger numbers 
of worker classes below. 

We now model worker behavior as two different response proba¬ 
bility matrices, the first corresponding to expert workers who have 
low error rates, and the second corresponding to regular workers 
who have higher error rates. Our problem now becomes that of es¬ 
timating the items’ true values in addition to both of the response 
probability matrices. For this discussion, we consider the filtering 
problem; a similar analysis can be applied to the rating case. 

Again, we extend our ideas of bucketizing and dominance to this 
setting. Let (y e , n e , y r , n r ) be the bucket representing all items 
that receive y e and n e responses of “1” and “0” respectively from 
experts, and y r and n r responses of “1” and “0” respectively from 
regular workers. A dominance partial ordering can be defined using 
the following rules. An item (respectively bucket) with response 
set B\ — dominates an item (respectively bucket) 

with response set B 2 = (yh rig, y 2 r , nf) if and only if one of the 
following is satisfied: 

• B i sees more responses of “1” and fewer responses of “0” than 
B 2 . That is, ( yl > yl) A (yl > yl) A ( n\ < n 2 e ) A (nl < n\) 
where at least one of the inequalities is strict. 

• B i and B 2 see the same number of “1” and “0” responses in 
total, but more experts respond “1” to B\ and “0” to B 2 . That 
is, (yl +Vr = yl + Vr) A (n\ +nl = n 2 e + n 2 r ) A (yl > 
yl) A (nl < tij) where at least one of the inequalities is strict. 

As before, we consider only the set of mappings that assign all 
items in a bucket the same value while preserving the dominance 
relationship, that is, dominating buckets get at least as high a value 
as dominated buckets. 

Note that the second dominance condition above leverages the 
assumption that experts have smaller error probabilities than reg¬ 
ular workers. If we were just given two classes of workers with 
no information about their response probability matrices, we could 
only use the first dominance condition. In general, having more 
information about the error probabilities of worker classes allows 
us to construct stronger dominance conditions, which in turn re¬ 
duces the number of dominance-consistent mappings. This prop¬ 
erty allows our framework to be flexible and adaptable to different 
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Number of "1" responses 
Figure 8: Variable number of responses 
granularities of prior knowledge. 

While this extension is reasonable when the number of distinct 
worker classes is small, it is impractical to generalize it to a large 
number of classes. One heuristic approach to tackling the problem 
of a large number of worker classes, or independent workers, could 
be to divide items into a large number discrete groups and assign a 
small distinct set of workers to evaluate each group of items. We 
then treat and solve each of the groups independently as a prob¬ 
lem instance with a small number worker classes. More efficient 
algorithms for this setting is a topic for future work. 

6. CONCLUSIONS 

We have taken a first step towards finding a global maximum 
likelihood solution to the problem of jointly estimating the item 
ground truth, and worker quality, in crowdsourced filtering and rat¬ 
ing tasks. Given worker ratings on a set of items (binary in the 
case of filtering), we show that the problem of jointly estimating 
the ratings of items and worker quality can be split into two inde¬ 
pendent problems. We use a few key, intuitive ideas to first find a 
global maximum likelihood mapping from items to ratings, thereby 
finding the most likely ground truth. We then show that the worker 
quality, modeled by a common response probability matrix, can 
be inferred automatically from the corresponding maximum like¬ 
lihood mapping. We develop a novel pruning and search-based 
approach, in which we greatly reduce the space of (originally ex¬ 
ponential) potential mappings to be considered, and prove that an 
exhaustive search in the reduced space is guaranteed to return a 
maximum likelihood solution. 

We performed experiments on real and synthetic data to com¬ 
pare our algorithm against an Expectation-Maximization based al¬ 
gorithm. We show that in spite of being optimized for the likelihood 
of mappings, our algorithm estimates the ground truth of item rat¬ 
ings and worker qualities with high accuracy, and performs well 
over a number of comparison metrics. 

Although we assume throughout most of this paper that all work¬ 
ers draw their responses independently from a common probability 
matrix, we generalize our approach to the cases where different 
worker classes draw their responses from different matrices. Like¬ 
wise, we assume a fixed number of responses for each item, but 
we can generalize to the case where different items may receive 
different numbers of responses. 

It should be noted that although our framework generalizes to 
these extensions, including the case where each worker has an inde¬ 
pendent, different quality, the algorithms can be inefficient in prac¬ 
tice. We have not considered the problem of item difficulties in this 
paper, assuming that workers have the same quality of responses on 
all items. As future work, we hope that the ideas described in this 
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paper can be built upon to design efficient algorithms that find a 
global maximum likelihood mapping under more general settings. 
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A.l Experiments 


Metrics. Earth-movers distance, or EMD, is a metric function that 
captures how similar two probability distributions are. Intuitively, 
if the two distributions are represented as piles of sand, EMD is a 
measure of the minimum amount of sand that needs to be shifted 
to make the two piles equal. In our problem, the worker response 
matrix p can be represented as two probability distributions corre¬ 
sponding to p(i , 1) and p(z, 0), that is the probability distributions 
of worker responses given that the true value of an item is 1 and 0 
respectively. We compute the EMD of p(i , 1) from ptme(i, 1) and 
p(i , 0) from p t rue(^, 0) and record their sum as the EMD “score” of 
the algorithm that predicts p. Since the EMD between p(i, j) and 
Ptme(i, j) lies between [0,1], our EMD score that sums the individ¬ 
ual EMDs for i = 0,1 lies in [0, 2]. 

We also compute a similar score using the Jensen-Shannon di¬ 
vergence (JSD), which is another standard metric for measuring 
the similarity between two probability distributions. We compute 
the JSD between p(z, 1) and ptm&(i, 1), and p(i , 0) and p t rue(7, 0). 
As with the EMD based score, we compute the sum of these two 
JSD values and use it as our comparison metric. 


Additional results. We now present some additional experimental 
plots comparing our algorithm against the different EM instances 
for the metrics of likelihood, EMD based score, JSD based score, 
and fraction of items predicted incorrectly, similar to Section [T.5.1| 

Figures [9(a)] and [9(b)] plot the fraction of items predicted incor¬ 
rectly and JSD score respectively for a selectivity of 0.5. We see 
that both our algorithm and EM perform comparably on these met¬ 
rics and give high accuracy. Figure [9(c)] plots the JSD score for a 
selectivity of 0.7. We observe that here our algorithm outperforms 
the aggregated EM {*) algorithm, while EM{ 1) is comparable. 

We also generate synthetic data with the ground truth having a 
selectivity of 0.9, that is 90% of items have a true value of 1 and 
10% have a true value of 0. We observe, from Figures |10(b)| and 


10 (c) that truth our algorithm outperforms EM(*), but does worse 


than EM ( 1) over this highly skewed ground. We explain this ef¬ 
fect at a high level with the following intuitive example: suppose 
all items had a true value of 1 and workers had a false negative 
error rate of 0.2. Then, we expect 80% of all worker responses 
to be 1, and the remaining 20% to be 0. Now for this worker re¬ 
sponse set, the pair s = 1, e\ — 0.2 is less likely than other less 
“extreme” solutions, s = 0.9, e± =0.1 for example. As a result, 
while EM( 1) readily converges to such extreme solution points, 
OPT and EM(*) find more likely solutions which for such highly 
skewed instances turn out to be less accurate. In practice it is often 
not possible to predict when a given dataset will be skewed or “ex¬ 
treme”. Given such information, we can tune the EM and OPT 
algorithms to account for the skewness and find better solutions. 


B. RATING 
B.l Experiments 

Metrics. In this section, we present results over different input 
ground truth distributions over the same experimental setup de¬ 
scribed in Section 14.31 Recall that there we describe our results 
for the case where items are equally divided across the rating val¬ 
ues, that is, one-third each of the items have true ratings 1, 2 and 3 
respectively. 
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Figure 9: Synthetic Data Experiments: (a)Fraction Incorrect, s = 0.5 (b) JSD Score, s = 0.5 (c) JSD Score, s = 0.7 





Figure 10: Synthetic Data Experiments: (a) Likelihood, s = 0.9 (b)Fraction Incorrect, s = 0.9 (c) JSD Score, s = 0.9 





Figure 11: Synthetic Data Experiments: (a) Likelihood, s = 2 (b)Distance Weighted Score, s = 2 (c) EMD Score, s = 2 





Figure 12: Synthetic Data Experiments: (a) Likelihood, s = 3 (b)Distance Weighted Score, s = 3 (c) EMD Score, s = 3 


In addition to the distance-weighted item prediction score, and 
likelihood metrics described in Section |4.3[ we also run experi¬ 
ments on our synthetic data using the EMD based score described 
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in Section IaTTI Note that here the EMD score is based on the sum 
of R = 3 pairwise EMD values corresponding to each column of 
the 3 x 3 response probability matrix. 

























































































































Figure 13: Dominance constraint 


Figure 14: Dominance constraint 


In Figure [IT] we plot experiments where 20% of the items have 
ground truth rating 1,60% have ground truth rating 2 and 20% have 
ground truth rating 3. We respresent this ground truth distribution, 
or selectivity vector by the notation s — 2. In Figure[l2](s = 3) we 
plot experiments where 40% of the items have ground truth rating 
1, 20% have ground truth rating 2 and 40% have ground truth rating 
3. 

We observe that for all these experiments, the results are very 
similar to those seen in Section |4.3[ Figures [6] and [7] We ob¬ 
serve that our algorithm finds more likely mappings (Figures [ll(a)[ 
|12(a)|), pred icts item ground truth ratings with higher accuracy (Fig¬ 
ures |1 1(b)] 1 12(b)) and obtains a better estimate for the worker re¬ 
sponse probability matrix (Figures |ll(c)[ |12(c)| > than all EM in¬ 
stances. 


C. EXTENSIONS 


C.l Variable number of responses 

In this section, we calculate the number of dominance-consistent 
mappings for the filtering problem where different items can each 
receive a different number of worker responses. Let m be the max¬ 
imum number of worker responses that any item receives. Recall 
from Section [5T) hat we can represent the set of all possible item 
response sets in the dominance-DAG shown in Figure [8] 

Let / be any dominance-consistent mapping under this setting. 
Let (z, j) be a response set with i responses of 1 and j responses 
of 0 such that = 1. Then, by our dominance constraint, we 

know that f(i + Ai,j + A j) = IV A i > 0, A j > 0. We represent 
this figuratively in Figure pj] If /(z, j ) = 1, then all response sets, 
or points, in the shaded area also get mapped to a value of 1. 

Now, by applying the dominance constraint as shown in Figure 
13 to every (z, j) such that /(z, j) = 1, we can show that / can now 
be described intuitively by its “boundary”. We demonstrate this in¬ 
tuition in Figure [l4] Every point, (z, j), on /’s boundary satisfies 
/(z, j) = 1. Additionally, every point “within” the boundary, every 
point that is to the right of and below the boundary gets mapped to 
a value of 1 under /. Finally, every point that is not on or within the 
boundary gets mapped to a value of 0 under /. It follows from the 
dominance constraint that every dominance-consistent mapping, /, 
can be represented by a unique such continuous boundary. Fur¬ 
thermore, it is easy to see that any such boundary satisfies three 
conditions: 

• Its leftmost (corner) point lies on one of the axes. 

• Its topmost (corner) point lies on the line x + y = m. 


• If (xi,y±) and (#2,2/2) lie on the boundary, then x\ < X 2 
Vi < 2/2. 

Intuitively the above three conditions give us a constructive def¬ 
inition for any dominance-consistent mapping’s boundary. Every 
dominance-consistent mapping can be constructed uniquely as fol¬ 
lows: (1) Choose a left corner point lying on one of the axes. (2) 
Choose a topmost corner point (necessarily above and to the right 
of the first corner point) lying on the line x + y = m. (3) Finally, 
define the boundary as a unique grid traversal from the left corner 
to the top corner where you are allowed to extend the boundary 
only to the right or upwards. Each such boundary corresponds to a 
unique dominance-consistent mapping where every point on or un¬ 
der the boundary is mapped to 1 and every other point is mapped to 
0. Furthermore, every dominance-consistent mapping has a unique 
such boundary. 

Therefore our problem of counting the number of dominance- 
consistent mappings for this setting reduces to counting the num¬ 
ber of such boundaries. We use our constructive definition for the 
boundary to compute this number. First, suppose the leftmost cor¬ 
ner point, L — (p, 0), 0 < p < ?7z, lies on the x-axis (we can 
calculate similarly for (0, q)). Now, the topmost corner point lies 
to the right of the first corner point, and on the line x + y = m. 
Therefore it is of the form T — (p + z, m — (p + z)) for some 
0 < z < m — p. The number of unique grid traversals (respectively 
boundaries) from L to T is given by ( m ~ p ). Combining, we have 
the number of unique boundaries that have their left corner on the 

m m—p m 

x-axis is £ X) ( m “ P ) = E2 m - p = 2 m+1 - 1. Calculating 

p= 0z=0 p =0 

similarly for boundaries that start with their leftmost corner on the 
y-axis ((0, q), 1 < q < m) and including the empty boundary (cor¬ 
responding to the mapping where all items get assigned a value of 
0), we get an additional 2 m boundaries. Therefore, we conclude 
that there are 0(2 m ) such boundaries, corresponding to 0(2 m ) 
dominance-consistent mappings. It should be noted that although 
this is exponential in the maximum number of worker responses to 
an item, typical values of m are small enough that all mappings can 
very easily be enumerated and evaluated. 


17 






















